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The standardization body ISO/IEC/JTC1/SC29/WG1 1 (Moving Pictures Expert Group, 
MPEG) was drafting a standard for compressing the high bit rate of moving pictures and 
associated audio down to 1 .5 Mbit/s. The audio part of the proposed standard is described. 
Three layers of the audio coding scheme with increasing complexity and performance 
were defined. These layers were developed in collaboration mainly with AT&T, CCETT, 
FhG/University of Erlangen, Philips, IRT, and Thomson Consumer Electronics. The 
generic coding system is suitable for different applications, such as storage on inexpensive 
storage media or transmission over channels with limited capacity (such as digital audio 
broadcasting or ISDN audio transmission). 



0 INTRODUCTION 

The necessity to specify a generic video and audio 
coding scheme for many applications dealing with digi- 
tally coded video and audio and requiring low data rates 
has led the ISO/IEC standardization body to establish 
the ISO/IEC JTC1/SC29/WG1 1 , called MPEG (Moving 
Pictures Experts Group). This group had the task to 
compare and assess several digital audio low-bit-rate 
coding techniques in order to develop an international 
standard for the coded representation of moving pic- 
tures, associated audio, and their combination when 
used for storage and retrieval on digital storage media 
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(DSM). The DSM targeted by MPEG include CD-ROM, 
DAT, magneto-optical disks, and computer disks, and 
it is expected that MPEG-based bit-rate reduction tech- 
niques will be used in a variety of communication chan- 
nels such as ISDN and local area networks and in broad- 
casting applications. The international standard ISO/IEC 
11172 "Coding of Moving Pictures and Associated 
Audio for Digital Storage Media at up to about 1.5 Mbit/s" 
was finalized in November 1992 and consists of three 
parts: system, video, and audio [1]. The system part 
(11172-1) deals with synchronization and multiplexing 
of audio- visual information, whereas the video (1 1 172- 
2) and audio (11172-3) parts address the video and the 
audio bit-rate reduction techniques, respectively. This 
standard is also known as the MPEG-1 standard. 

MPEG-2 Audio is the consequent extension from two 
to five audio channels providing backward compatibility 
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to MPEG-1. The main aspects are high quality of five 
( -I- 1) audio channels, low bit rate and backward compat- 
ibility — the key to insuring that existing 2-channel de- 
coders will still be able to decode compatible stereo 
information from five (+ 1) multichannel signals. 

This standard, which is expected in November 1994, 
is based on standards and recommendations from inter- 
national organizations such as ITU-R, SMPTE, and 
EBU. International standardization bodies will insure 
the highest audio signal quality by extensive testing. 
For audio reproduction the loudspeaker positions left, 
center, right, left and right surround are used, according 
to the 3/2-standard. 

1 STANDARDIZATION AND QUALITY 
ASSESSMENTS WITHIN MPEG-1 AUDIO 

Since 1988 ISO/MPEG has been undertaking the stan- 
dardization of compression techniques for video and as- 
sociated audio. The main topic for standardization in 
MPEG was video coding together with audio coding 
for DSM. On the other hand the audio coding standard 
developed by this group was the first international stan- 
dard in the field of digital audio compression and is 
expected to be followed in different applications . Beside 
several subgroups such as video, system, test, imple- 
mentation, requirement, and DSM, the audio subgroup 
of MPEG had the responsibility for developing a stan- 
dard for coding of PCM audio signals with sampling 
rates of 32, 44. 1 , and 48 kHz at bit rates in a range of 
32-192 kbit/s per mono and 64-384 kbiUs per stereo 
audio channel- The operating modes are 

♦ Single channel 

♦ Dual channel, like bilingual 

• Stereo 

• Joint stereo (combined coding of left and right chan- 
nels of a stereophonic audio program) 

Table 1 gives a short general view of the milestones 
of the MPEG- AUDIO group. This group asked for pro- 
posals for the audio coding standard in mid-1989, and 14 
proposals were submitted for this purpose. The original 
proposals were grouped into four clusters according to 
algorithmic similarities. The clustered candidate algo- 
rithms were called ASPEC, AT AC, MUSICAM, and 
SB/ADPCM. 

A number of subjective tests were performed [2]-[4] 
since mid- 1990 to assess the audio quality of the ISO/ 
MPEG/Audio coding standard. During this time period 
several improvements have been made to meet the pres- 
ent audio quality. The important milestones in the devel- 
opment of the standard have been the official tests orga- 
nized by the Swedish Broadcasting Corporation in 
Stockholm under the auspices of ISO and EBU. In July 
1990 large listening tests and objective evaluations, such 
as basic audio quality at different bit rates, sensitivity to 
transmission bit errors , encoder and decoder complexity , 
and coding delay, were performed on prototype real- 
time implementations of the four clustered algorithms. 
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Both the ASPEC and MUSICAM proposals have shown 
a very high subjective quality at bit rates of about 100 

kbit/s per channel. 

Due to the result that the proposals of the ASPEC 
and MUSICAM groups have been subjectively nearly 
equally rated, and were judged relatively close in their 
overall performance, the official decision was as fol- 
lows [5]: 

... the MPEG standardization committee decided to 
approve a collaborative development of the draft audio 
coding standard between the ASPEC and MUSICAM 
groups, because the ASPEC codec was slightly superior 
with respect to the audio quality, especially for lower 
bit rates (64 kbit/s/channel), and the MUSICAM codec 
was slightly superior with respect to implementation 
complexity and decoding delay. The decision was that 
MUSICAM should be the basis for the low-complexity 
first layer, and algorithmic refinements including contri- 
butions of ASPEC should be used in the subsequent 
layers. 

Table 1. Milestones of ISO/MPEG- Audio group during the 
development of audio part of the International Standard U 17/. 



Date 



Activities 



1988 December 



1989 January to 1990 March 
1989 May 

1989 June 

1989 October 

1989 December 



1990 May 

1990 June 
1990 August 

1990 December 

1991 May 
1991 June 

1991 November 



1991 December 



1992 November 



First audio meeting in 

Hanover 
Preparation of tests 
Determining requirements and 

weighting procedure 
Proposal of 14 algorithms to 

be tested 
Clustering of proponents into 

four groups 
Detailed description of four 
clustered proposals: 
ASPEC, AT AC, 
MUSICAM, and 
SB-ADPCM 
Exchange of tapes with coded 
audio sequences between 
four clusters 
Subjective and objective tests 

at SR, Stockholm 
Presentation of results and 
decision to follow a layer 
concept 
First draft of part 3, "Audio 
Coding" of International 
Standard ISO 11172 was 
prepared. 
Verification of three layers by 
subjective testing, again at 
SR in Stockholm 
Layers I and II are frozen; 
Layer III and 'joint stereo 
coding' are still under 
discussion 
Second verification of Layer 
III and first checking of 
Joint Stereo Coding by 
subjective testing at NDR 
in Hanover 
Draft of International 
Standard (DIS) ready for 
balloting at national 
standardization bodies 
International Standard ISO/ 
1EC 11172-3 accepted by 
national standardization 
bodies 
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A three-layer coding algorithm has been defined. These 
three layers were tested again in April 1991 by the Swed- 
ish Broadcasting Corporation [3], and a last verification 
test for the very low bit rate of 64 kbit/s/channel and 
"joint stereo coding" was carried out by the University 
of Hanover under the auspices of NDR in November 
1991 J4]. In November 1991 the final proposal, consist- 
ing of three modes of operation called "Layers," was 
adapted by ISO/MPEG [6]. 

2 BASIC STRUCTURE OF A GENERIC AUDIO 
CODING SCHEME USING PERCEPTUAL 
CRITERIA 

The basic structure of a perceptual audio coding 
scheme is shown in Fig. 1. 

1) A time-frequency mapping (filter bank) is used 
to decompose the input signal into subsampled spectral 
components. Depending on the filter bank used, these 
are called subband values or frequency lines. 

2) The output of this filter bank, or the output of a 
parallel transform, is used to calculate an estimate of the 
actual (time-dependent) masking threshold using rules 
known from psychoacoustics. 

3) The subband samples or frequency lines are quan- 
tized and coded with the aim of keeping the noise, which 
is introduced by quantizing, below the masking thresh- 
old. Depending on the algorithm, this step is done in 
very different ways. The complexity varies from block 
companding to analysis-by-synthesis systems using ad- 
ditional noiseless compression. 

4) A frame packing is used to assemble the bit stream, 
which typically consists of the quantized and coded 
mapped samples and some side information, such as bit 
allocation information. 

Depending on the focus on either low frequency reso- 
lution together with high time resolution or high fre- 
quency resolution which leads to only limited time reso- 
lution, the systems are usually called subband coders or 
transform coders. 



2.1 Filter Banks 

The following list provides a short overview over the 
most common filter banks used for coding of high-qual- 
ity audio signals: 

1) QMF-Tree Filter Banks: Different frequency reso- 
lution at different frequencies is possible. Typical QMF- 
tree filter banks use from 4 to 24 bands. The computa- 
tional complexity is high. 

2) Polyphase Filter Banks: These are equally spaced 
filter banks which combine the filter design flexibility 
of generalized QMF banks with low computational com- 
plexity [7). It is possible to design the prototype filter 
in a way that achieves both good frequency resolution 
(stop-band attenuation better than 96 dB) and good con- 
trol of possible time-domain artifacts. A polyphase filter 
bank using 32 bands is used for Layers 1 and II of the 
ISO/MPEG audio coder. 

3) DFT, DCT with Sine-Taper Window: These were 
the first transforms used in transform coding of audio 
signals. They implement equally spaced filter banks with 
128-5 12 bands at a low computational complexity. They 
do not provide critical sampling, that is, the number of 
time- frequency components is greater than the number 
of time samples represented by one block length. An- 
other disadvantage of these transforms are possible 
blocking artifacts. 

4) Modified Discrete Cosine Transform (MDCT, us- 
ing time-domain aliasing cancellation as proposed in 
18}): This transform combines critical sampling with a 
good frequency resolution provided by a sine window 
(compared to a sine-taper window) and the computa- 
tional efficiency of a fast FFT-iike algorithm. Typically 
128-512 equally spaced bands are used. 

5) Hybrid Structures (such as polyphase and MDCT): 
Using hybrid structures as first proposed in [9] it is 
possible to combine different frequency resolutions at 
different frequencies with moderate implementation 
complexity. A hybrid system consisting of a polyphase 
filter bank and an MDCT is used in Layer III. 
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(b) 

Fig. 1. (a) Basic structure of ISO/MPEG/Audio encoder, (b) Basic structure of ISO/MPEG/Audio decoder. 
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Theoretically MDCT and polyphase filter banks be- 
long to the same class of time -frequency domain map- 
pings, called lapped orthogonal transform. 

3 GENERIC CODING CONCEPT 

In view of a number of totally different applications, 
a concept of a generic coding system was envisioned. 
Depending on the application, three layers of the coding 
system with increasing complexity and performance can 
be used. A standard ISO decoder is able to decode bit- 
stream data which have been encoded in any of the 
layers. There will also be standard ISO Layer X de- 
coders, which are able to decode Layers X and X - n. 
The ISO/MPEG/ Audio coding technique offers to deal 
with a much higher dynamic range, due to the scaling 
technique used, than Compact Disc or DAT, that is, 
conventional 16-bit PCM. 

In all three layers the input PCM audio signal is con- 
verted from the time domain into a frequency domain. 
This is done by a polyphase filter bank consisting of 32 
subbands [7J. 

In Layers I and II a filter bank creates 32 subband 
representations of the input audio stream, which are then 
quantized and coded under the control of a psycho- 
acoustic model from which a blockwise adaptive bit 
allocation is derived. 

Layer I is a simplified version of the MUSICAM cod- 
ing scheme, most appropriate for consumer applications 
such as digital home recording on tapes, Winchester 
discs, or on magneto-optical disks, that is, for those 
applications for which very low data rates are not 
mandatory. 

Layer II introduces further compression with respect 
to Layer I by redundance and irrelevance removal on 
the scale factors, and uses more precise quantization. 
Layer II is nearly identical with the MUSICAM scheme 
[10], [11], with the exception of the frame header. This 
header has been added to the MUSICAM frame during 
the ISO/MPEG/ Audio development work. Layer II has 
numerous applications in both consumer and profes- 
sional audio, such as audio broadcasting, television, re- 
cording, telecommunication, and multimedia [12]. 

Layer III consists of a combination of the most effec- 
tive modules of the ASPEC [13] and MUSICAM coding 
schemes. An additional frequency resolution is provided 
by the use of a hybrid filter bank. Every subband is 
thereby further split into higher-resolution frequency 
lines by a linear transform that operates on 18 subband 
samples in each subband. In Layer III, nonuniform quan- 
tization, adaptive segmentation, and entropy coding of 
the quantized values are employed for a better coding 
efficiency. The application of this layer is appropriate 
most of all in telecommunication, in particular with nar- 
row-band ISDN and in the field of professional audio 
with high weights on very low bit rates. 

Joint stereo coding can be added as an additional fea- 
ture to any of the layers. This technique exploits the 
redundancy and irrelevance of typical stereophonic pro- 
gram material and can be used to increase the audio 
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quality at low bit rates or reduce the bit rate for stereo- 
phonic signals [14], [15]. The increase of encoder com- 
plexity is small and requires negligible additional de- 
coder complexity. Joint stereo coding does not enlarge 
the overall coding delay. 

3.1 Psychoacoustic Models 

The psychoacoustic model calculates the minimum 
masking threshold necessary to determine the just no- 
ticeable noise level for each band in the filter bank. The 
difference between the maximum signal level and the 
minimum masking threshold is used in the bit or noise 
allocation to determine the actual quantizer level in each 
subband for each block. Two psychoacoustic models are 
given in the informative part of the standard. While they 
can both be applied to any layer of the MPEG/Audio 
algorithm, in practice model 1 will be used for Layers I 
and II, and model 2 for Layer III. In both psychoacoustic 
models the final output of the model is a signal-to-mask 
ratio for each subband (Layers I and II) or group of 
bands (Layer III). The psychoacoustic models are only 
necessary in the encoder. This allows decoders of sig- 
nificantly less complexity. It is therefore possible to im- 
prove even later the performance of the encoder, relating 
the ratio of bit rate to subjective quality. For some appli- 
cations which are not demanding a very low bit rate, it 
is even possible to use a very simple encoder without 
any psychoacoustic model. 

3.1.1 Psychoacoustic Model 1 

A high frequency resolution, that is, small subbands 
in the lower frequency region, and a lower resolution in 
the higher frequency region with wide subbands should 
be the basis for an adequate calculation of the masking 
thresholds in the frequency domain. This would lead to 
a tree structure of the filter bank. The polyphase filter 
network used for the subband filtering has a parallel 
structure which does not provide subbands of different 
widths. Nevertheless, one major advantage of the filter 
bank is given by adapting the audio blocks optimally to 
the requirements of the temporal masking effects and 
inaudible preechoes. The second major advantage is 
given by the small delay and complexity. To compensate 
for the lack of accuracy of the spectrum analysis of the 
filter bank, a 512-point fast Fourier transform (FFT) for 
Layer I, and a 1024-point FFT for Layer II are used in 
parallel to the process of filtering the audio signal into 
32 subbands [16]. The output of the FFT is used to 
determine the relevant tonal, that is, sinusoidal, and 
nontonal, that is, noise maskers, of the actual audio 
signal. It is well known from psychoacoustic research 
that the tonality of a masking component has an influence 
on the masking threshold. For this reason it is worth- 
while to discriminate between tonal and nontonal compo- 
nents. The individual masking threshold for each masker 
above the absolute masking threshold are calculated de- 
pending on frequency position, loudness level, and to- 
nality. All the individual masking thresholds, including 
the absolute threshold, are added to the so-called global 
masking threshold. For each subband the minimum 
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value of this masking curve is determined. Finally the 
difference between the maximum signal level, calculated 
by both the scale factors and the power density spectrum 
of the FFT, and the minimum masking threshold is cal- 
culated for each subband and each block. The block- 
length for Layer I is determined by 12 subband samples, 
corresponding to 384 input audio PCM samples, and for 
Layer II by 36 subband samples, corresponding to 1 152 
input audio PCM samples. This difference of maximum 
signal level and minimum masking threshold is called 
signal-to-mask ratio (SMR) and is the relevant input 
function for the bit allocation. 

3.1.2 Psychoacoustic Model 2 

The frequency-domain representation of the data is 
calculated via FFT with a window length of 1024 sam- 
ples. The calculation is done every 576 samples, that 
is, synchronous to the hybrid filter bank. The separate 
calculation of the frequency-domain representation is 
necessary because the hybrid filter bank values cannot 
easily be used to get a magnitude-phase representation 
of the input sequence. The magnitude-phase representa- 
tion is necessary to calculate the tonality of the current 
input block for every frequency component. 

The tonality estimation works using a simple polyno- 
mial predictor, as described in [9]. The basic idea is to 
use the predictability of the signal as an indicator for 
tonality. The prediction is done in the magnitude-phase 
domain. The values stores from the last two blocks are 
-used to predict the magnitude and phase of each fre- 
. quency line for the current block. The Euclidian distance 
between estimated and actual values in the magni- 
tude-phase domain is normalized to the maximum pos- 
sible distance. The normalized value is called "chaos 
measure" and can assume values between 0 (the rotating 
phasor prediction had 0 distance from the actual value) 
and 1 (the predicted value has the maximum distance 
from the actual value). A logarithmic mapping is used 
to map the chaos measure range between 0.5 and 0.05 
to tonality values of between 0 and 1 . 

The magnitude values of the frequency-domain repre- 
sentation are converted to a one-third critical band en- 
ergy representation. A convolution of these values with 
the cochlea spreading function follows. The next step 
in the threshold estimation is the calculation of the just 
masked noise level in the cochlea domain using the to- 
nality index and the convolved spectrum. A correction 
for the dc gain of the convolution has to be applied. The 
last step to get the preliminary estimated threshold is 
the adjustment for the absolute threshold. As the sound 
pressure level of the final audio output is not known in 
advance, the absolute threshold is assumed to be some 
amount below the LSB for the frequencies around 4 kHz. 
A more detailed description of the estimation of the 
masking threshold using spreading convolution can be 

found in [17]. 

The final step in the calculation of the threshold is 
preecho control. Preechoes are audible if the backward 
masking of the signal is not sufficient to mask the error 
signal, which was spread in time due to the limited 
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time resolution of the synthesis filter bank. This is only 
possible if there is a sudden increase in signal energy, 
at least for part of the signal bandwidth. From this a 
sufficient (but not necessary) condition for the absence 
of audible preechoes can be derived. The estimated 
masking threshold is restricted not to exceed the prelimi- 
nary estimated threshold of the last block. This condition 
on the final estimated threshold may reduce the estimated 
threshold by a large amount. To keep the actual quanti- 
zation noise below this modified threshold, additional 
bits need to be available to the quantization and coding 
loop. Layer III contains an intelligent buffer manage- 
ment scheme (called bit reservoir) in order to make the 
additional bits available when needed. This technique 
was taken from OCF (see [18]). 

4 LAYER I AND LAYER II CODING SCHEME 

Block diagrams of the Layer I and Layer II encoders 
are given in Fig. 2. The coding technique for these layers 
is based on a subband splitting of the input PCM audio 
signal by a polyphase analysis filter bank into 32 equally 
spaced subbands, a dynamic bit allocation derived from 
a psychoacoustic model, block companding of the sub- 
band samples, and the bit-stream formatting [10], [11], 
[19]. The individual steps of the encoding and decoding 
process are explained in detailed form in the following 
sections. 

4.1 Filter Bank 

The prototype QMF filter is of order 51 1. It is opti- 
mized in terms of spectral resolution and rejection of 
side lobes, which is better than 96 dB. This rejection is 
necessary for a sufficient cancellation of aliasing distor- 
tions. This filter bank provides a reasonable tradeoff 
between temporal behavior on one side and spectral ac- 
curacy on the other. A time-frequency mapping provid- 
ing a high number of subbands facilitates the bit-rate 
reduction due to the fact that the human ear perceives 
the audio information in the spectral domain with a reso- 
lution corresponding to the critical bands of the ear, or 
even lower. These critical bands have a width of about 
100 Hz in the low-frequency region, that is, below 500 
Hz, and widths of about 20% of the center frequency at 
higher frequencies. The requirement of having a good 
spectral resolution is unfortunately contradictory to the 
necessity of keeping the transient impulse response, the 
so-called pre- and postecho, within certain limits in 
terms of temporal position and amplitude compared to 
the attack of a percussive sound. The knowledge of the 
temporal masking behavior [20] gives an indication of 
the necessary temporal position and amplitude of the 
preecho generated by a time- frequency mapping in such 
a way that this preecho, which normally is much more 
critical compared to the postecho, is masked by the origi- 
nal attack. In conjunction with the dual-synthesis filter 
bank located in the decoder, this filter technique pro- 
vides a global transfer function optimized in terms of 
perfect impulse response perception. 
In the decoder the dual-synthesis filter bank recon- 
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structs a block of 32 output samples. The filter structure 
is extremely efficient for implementation in a low-com- 
plexity and non-DSP-based decoder and requires gener- 
ally fewer than 80 integer multiplications or additions 
per PCM output sample. Moreover, the complete analy- 
sis and synthesis filter gives an overall time delay of 
only 10.5 ms at a 48-kHz sampling rate. 

4.2 Determination and Coding of Scale Factors 

The calculation of the scale factor for each subband 
is performed for a block of 12 subband samples. The 
maximum of the absolute value of these 12 samples is 
determined and quantized with a word length of 6 bits, 
covering an overall dynamic range of 120 dB per sub- 
band with a resolution of 2 dB per scale factor class. In 



Layer I a scale factor is transmitted for each block and 
each subband that has no 0-bit allocation. 

Layer II uses an additional coding to reduce the trans- 
mission rate for the scale factors. Due to the fact that 
in Layer II a frame corresponds to 36 subband samples, 
that is, three times the length of a Layer I frame (see 
Fig. 3), three scale factors have to be transmitted in 
principle. To reduce the bit rate for the scale factors, a 
coding strategy which exploits the temporal masking 
effects of the ear has been studied. Three successive 
scale factors of each subband of one frame are consid- 
ered together and classified into certain scale factor pat- 
terns. Depending on the pattern, one, two, or three scale 
factors are transmitted with an additional scale factor 
select information consisting of 2 bits per subband. If 
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Fig. 2. Block diagram of ISO/MPEG/Audio encoder, Layer I and II (single-channel mode). 
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there are only small deviations from one scale factor to 
the next, only the bigger one has to be transmitted. This 
occurs relatively often for stationary tonal sounds. If 
attacks of percussive sounds have to be coded, two or 
all three scale factors have to be transmitted, depending 
on the rising and falling edge of the attack. This addi- 
tional coding technique allows on average a factor of 2 
reduction in the bit rate for the scale factors compared 
with Layer I. 

4.3 Bit Allocation and Encoding of Bit 
Allocation Information 

Before the adjustment to a fixed bit rate, the number 
of bits available for coding the samples must be deter* 
mined. This number depends on the number of bits re- 
quired for scale factors, scale factor select information, 
bit allocation information, and ancillary data. 

The bit allocation procedure is determined by mini- 
mizing the total noise-to-mask ratio over every subband 
and the entire frame. This procedure is an iterative pro- 
cess where in each iteration step the number of quantiz- 
ing levels of the subband that has the greatest benefit is 
increased with the constraint that the number of bits used 
must not exceed the number of bits available for that 
frame. Layer I uses 4 bits for coding the bit allocation 
information for each subband and frame, whereas Layer 
II uses 4 bits for the lowest subbands only and 2 bits 
for the highest. 

4.4 Quantization and Encoding of Subband 
Samples 

First each of the 12 subband samples of one block is 
normalized by dividing its value by the scale factor. The 
result is quantized according to the number of bits spent 
by the bit allocation block. Only odd numbers of quanti- 
zation levels are possible, allowing an exact representa- 
tion of a digital zero. Layer I uses 14 different quantiza- 
tion classes, containing 2" - 1 steps, with 2 n «s 15 
different quantization levels. This is the same for all 
subbands. In addition no quantization at all can be used 
if no bits are allocated to a subband. In Layer I each 
sample is coded independently by one code word. 

In Layer II the number of different quantization levels 
depends on the subband number, but the range of the 
quantization levels always covers a range of 3 to 65 535 
with the additional possibility of no quantization at all. 
Samples of subbands in the low-frequency region can 
be quantized with 15, in the midfrequency range with 
7, and in the high-frequency range with only 3 different 
quantization levels. The classes may contain 3, 5, 7, 9, 
15, 63, ...» 65 535 quantization levels. Since 3,5, 
and 9 quantization levels do not allow an efficient use 
of a code word consisting only of 2, 3, or 4 bits, three 
successive subband samples are grouped together to a 
"granule." Then the granule is coded with one code 
word. The coding gain by using the grouping is up to 
37.5%. Due to the fact that many subbands, especially 
in the high-frequency region, are typically quantized 
with only 3, 5,7, and 9 quantization levels, the reduction 
factor of the length of a code word is considerable. 
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4.5 Layer I and Layer II Bit-Stream Structure 

The bit stream of these layers was constructed in such 
a way that both a decoder of low complexity and low 
decoding delay can be used, and that the encoded audio 
signal contains a number of entry points with short and 
constant time intervals. The encoded digital representa- 
tion of an efficient coding algorithm specially suited for 
storage application must allow multiples of entry points 
in the encoded data stream to record, play, and edit 
short audio sequences and to define the editing positions 
precisely. To enable a simple implementation of the de- 
coder, the frame between those entry points must contain 
the whole information which is necessary for decoding 
the bit stream. Due to the different applications such a 
frame has to carry in addition all the information neces- 
sary for allowing a large coding range with many differ- 
ent parameters. These features are important too in the 
field of digital audio broadcasting, where a low-com- 
plexity decoder is necessary for economical reasons and 
where frequent entry points in the bit stream are needed, 
allowing an easy block concealment of consecutive erro- 
neous samples impaired by burst errors. 

The format of the encoded audio bit stream for Layers 
I and II is shown in Fig. 3. The structure of the bit 
stream is characterized by short autonomous frames cor- 
responding to either 384 PCM samples (8 ms for Layer 
I at 48-kHz sampling rate) or 1152 PCM samples (24 
ms for Layer II at 48 kHz). 

4.6 Layer I and Layer II Decoding 

The block diagram of the decoder is shown in Fig. 4. 
First of all, the header information, CRC check, side 
information, that is, the bit allocation information with 
scale factors, and 12 successive samples of each subband 
signal are separated from the ISO/MPEG/ Audio Layer 
I and II bit stream. The reconstruction process to obtain 
again PCM audio is characterized by filling up the data 
format of the subband samples with regard to the scale 
factor and bit allocation for each subband and frame. 
The synthesis filter bank reconstructs the complete 
broad-band audio signal with a bandwidth of up to 24 
kHz. The decoding process needs significantly less com- 
putation power than the encoding process. The relation 
for Layer I is about 1:2 and for Layer II it is even 1:3. 
Due to the low computation power needed and the straight- 
forward structure of the algorithm, both layers can be 
easily implemented into a special VLSI. Since 1993, stereo 
decoder chips have been available from several manufac- 
turers. Layer I and II stereo encoders are available which 
are implemented in only one fixed-point DSP (DSP56002). 

5 LAYER 111 CODING SCHEME 

A block diagram of the Layer III encoder is shown in 
Fig. 5. The corresponding decoder is shown in Fig. 6. 
The filter bank used in Layer III is a switched hybrid 
polyphase/MDCT filter bank. In the implementation 
tested by ISO the psychoacoustic model 2 is used to 
estimate the masking threshold. Nonuniform quantiza- 
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tion and Huffman coding are used to increase the coding 
efficiency. A buffer technique called bit reservoir is used 
to maintain coding efficiency and to keep the quantiza- 
tion noise below the masking threshold. Some of the 
blocks are explained in more detail in this section. 



5.1 Polyphase/MDCT Hybrid Filter Bank 

This filter bank was designed with the aim of offering 
compatibility with Layer II combined with advanced 
features. The output samples of each of the 32 channels 
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Fig. 5. Block diagram of ISO/MPEG/Audio encoder, Layer III (single-channel mode). 
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of the polyphase filter bank used by Layer II are fed into 
an 18-channel MDCT filter bank. The maximum total 
number of output channels is 32 x 18 = 576. Dueto the 
small number of channels a direct (matrix multiplication) 
implementation of the filter bank can be used without 
much penalty in terms of complexity. 

The idea of a hybrid filter bank was first used in [91 . 
It provided different time- frequency resolutions at dif- 
ferent frequencies in order to simulate the frequency - 
time resolution of the human auditory system. In Layer 
III the resolution is normally kept constant throughout 
the spectrum. This leads to the maximum transform gain 
for stationary signals. If necessary, a part or all of the 
MDCT filter banks can be switched to lower frequency 
resolution and higher time resolution. 

The window length is 36 in the case of long windows 
(overlap factor of 2 used in the MDCT) and 12 in the 
case of short windows. In order to maintain the time- 
domain alias cancellation property of the MDCT the 
number of lines must be a multiple of 4. This is the 
reason why 3:1 switching is used. 

5.2 Window Switching and Buffer Control 

A window length of 1 152 samples corresponds to 24 
ms at 48-kHz sampling frequency. All quantization er- 
rors in the frequency domain are spread over this time 
length. For signals containing attacks or similar time- 
domain events (triangle, castanets) this results in audible 
preechoes (see [21]). One method used to avoid pre- 
echoes is based on the possibility of dynamic changes 
in the window shape. This technique is based on the fact 
that alias terms which are caused by subsampling in the 
frequency domain of the MDCT are constrained to either 
half of the window. Adaptive window switching as used 
in Layer III is based on [22]. 

Fig. 7 shows the different windows used in Layer III, 
and Fig. 8 shows a typical sequence of window types if 
adaptive window switching is used. The function of the 



different window types is explained as follows: 

1) Long Window. This is the normal window type 
used for stationary signals. 

2) Start Window. In order to switch between the long 
and short window types, this hybrid window is used. The 
left half has the same form as the left half of the long 
window type. The right half has the value of 1 for one- 
third of the length and the shape of the right half of a short 
window for one-third of the length. The remaining one- 
third of the window is 0. Thus alias cancellation can be 
obtained for the part that overlaps the short window. 

3) Short Window. The short window has basically the 
same form as the long window, but with one-third of 
the window length. It is followed by a MDCT of one- 
third length. The time resolution is enhanced to 4 ms at 
48-kHz sampling frequency. The combined frequency 
resolution of the hybrid filter bank in the case of short 
windows is 192 lines, compared to 576 lines for the 
normal windows used in Layer III. 

4) Stop Window. This window type enables the 
switching from short windows back to normal windows. 

A criterion when to switch the window form is neces- 
sary to get the preecho control working. 

The use of a hybrid filter bank facilitates advanced 
preecho control schemes. In the case of a preecho condi- 
tion all or part of the MDCT filter banks are switched 
to a better time resolution, as described. The criterion 
to switch the filter bank is derived from the threshold 
calculation. If preecho control is implemented in the 
threshold calculation as described, preecho conditions 
result in a much increased estimated perceptual entropy 
(PE) [17], that is , in the amount of bits needed to encode 
the signal. If the demand for bits exceeds the average 
value by some extent, a preecho condition is assumed 
and the window switching logic activated. Experimental 
data suggest that a big surge in PE is always due to 
preecho conditions. Therefore preecho detection via the 
threshold calculation works without errors. 




(a) 






Fig. 7. Different types of windows used for MDCT in ISO/MPEG/Audio Layer III. (a) Normal window, (b) Start window, (c) 
Short window, (d) Stop window. 
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Fig. 8. Typical sequence of window types for adaptive window switching in ISO/MPEG/ Audio Layer III. 
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5.3 Quantization 

Layer HI uses nonuniform quantization. The basic 

formula is 

is(i) = nint (((xr(i)/quant)**0.75) -0.0946) 



absolute value of frequency line at index i 
actual quantizer step size 
nearest integer function 
quantized absolute value at index / 



where 

xr(i) 
quant 
nint 
is(i) 

The quantization is of the midtread type, that is, values 
around zero get quantized to zero and the quantizer is 
symmetric. 

The nonuniform quantization was chosen to imple- 
ment some noise shaping by default. Bigger values are 
quantized less accurately than smaller values. 

5.4 Huffman Coding 

The quantized information is coded using several differ- 
ent coding methods. A series of zero at high frequencies 
is coded by run length coding. For the next region with 
values not exceeding 1 in magnitude a four-dimensional 
Huffman code is applied. The remaining "big values" re- 
gion will be coded with a two-dimensional Huffman 
scheme and can optionally be split into three subregions, 
each having a separately selectable Huffman code table. 
By individually adapting code tables to subregions coding 
efficiency is enhanced, and simultaneously sensitivity 
against transmission errors is decreased. The largest tables 
used for Layer III contain 16 by 16 entries. Larger values 
are coded using an escape mechanism. 

5.5 Analysis by Synthesis 

The frequency lines are quantized and coded within 
two nested loops. In the first loop the overall quantizer 
step size is adjusted to ensure that the amount of data 
needed for coding the information does not exceed the 
number of bits available for one block. 

In the second (outer) loop the calculated solution is 
evaluated with respect to the psychoacoustic demands 
imposed by masking conditions. This is done in an analy- 
sis-by-synthesis procedure, which compares the actual 
quantization error to the previously calculated masking 
threshold and accordingly adapts the individual weight- 
ing factor of each scale-factor band. 



5.6 Bit-Stream Structure 

The bit-stream organization closely follows Layer II. 
The frame length for Layers III and II is identical. Each 
frame of 1152 time-domain samples is subdivided into 
two granules of 576 samples. The frame header (as used 
for all ISO/MPEG/Audio layers) is followed by the side 
information common to all granules. The side informa- 
tion blocks for the granules follow. They are of constant 
length (59 bits each) through all modes. The length of 
the main information for each granule is explicitly con- 
tained as part of the side information. This makes it 
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very easy to address the ancillary information, which is 
located last in every block. The length of the total side 
information as well as the length of the main information 
are always an integer number of bytes. 

5.7 The Bit Reservoir Technique 

The window switching as described does not succeed 
in avoiding all audible preechoes. This is due to the fact 
that even the length of the "short" windows corresponds 
to a time of 8 ms at 48-kHz sampling frequency. Due 
to this fact a combination of the PE-driven preecho con- 
trol, as described in Section 3.1 and the window switch- 
ing is used to prevent audible preechoes. Unfortunately 
there is a large increase in PE, that is, in the amount of 
bits needed to code the signal for the frames where 

preecho control is used. 

A buffer technique called bit reservoir was introduced 
to satisfy the additional need for bits due to the preecho 
control. It can be described as follows. The amount of 
bits corresponding to a frame is no longer constant, but 
varies with a constant long-term average. To accomodate 
fixed-rate channels, a maximum accumulated deviation 
of the actual bit rate to the target (mean) bit rate is 
allowed. The deviation is always negative, that is, the 
actual mean bit rate is never allowed to exceed the chan- 
nel capacity. An additional delay in the decoder takes 
care of the maximum accumulated deviation from the 

target bit rate. 

If the actual accumulated deviation from the target bi 
rate is zero, then (by definition) it holds that the actual 
bit rate equals the target bit rate. In this case the bit 
reservoir is called empty. If there is an accumulated 
deviation of n bits, then the next frame may use up to 
n bits more than the average number without exceeding 
the mean bit rate. In this case the bit reservoir is said 

to hold n bits. 

This is used in the following way in Layer III. Nor- 
mally the bit reservoir is kept at somewhat below the 
maximum number (accumulated deviation). If there is 
a surge in PE due to the preecho control, then additional 
bits "taken from the reservoir" are used to code this 
particular frame according to the PE demand. In the next 
few frames every frame is coded using some bits Jess 
than the average amount. The bit reservoir gets filled 

up" again. - 
Figs 9 and 10 show examples of the succession ot 
frames with different amounts of bits actually used. A 
pointer called "main data begin" is used to transmit the 
information about the actual accumulated deviation from 
the mean bit rate to the decoder. The side information 
is still transmitted with the frame rate as derived from 
the channel capacity (mean rate). The main data begin 
pointer is used to find the main information in the input 
buffer of the decoder. 



6 APPLICATIONS OF ISO/MPEG/ AUDIO 

There is a wide field of applications for low-bit-rate 
audio coding schemes. The applications are in the areas 
of consumer audio as well as professional audio. Any 
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medium with a channel capacity of 256 kbit/s can be 
used to distribute a stereophonic audio program with 
either Layer II or Layer HI with no subjective degrada- 
tion compared to 16-bit PCM. If the channel capacity 
allows a bit rate of 384 kbit/s, even the low-complexity 
Layer I can be used for highest audio quality. With bit 
rates of 64 kbit/s per channel or 2*64 kbit/s per stereo 
program MPEG-Audio Layer II and even more Layer 
III provide a subjective audio quality that comes close 
to the original 16-bit PCM for normal audio material. 
In 1992 and 93, the specialists group TG 10/2 of ITU-R 
performed a lot of tests with different types of codecs 
for applications such as contribution, distribution, com- 
mentary and emission. Some of the tests included a cas- 
cade of up to nine codecs. The tests showed that only 
the MPEG-AUDIO Layers fulfilled the requirements of 
ITU-R. Layer II received the recommendation for contri- 



bution, distribution and emission, and Layer III was 
recommended for commentary application. 
The first applications of MPEG-Audio are the following: 

• Storage and editing of digital audio on small computers 
(home studio). 

• Computer-based multimedia. 

• Digital audio broadcasting (terrestrial and satellite). 
DAB which was developed by the EUREKA 147 proj- 
ect, and ADR (ASTRA Digital Radio) are using Layer 
II for sound coding. 

• Multichannel audio for ADTV and HDTV. 

• Audio recording and reproduction with magnetic 
tapes, Winchester disks, magneto-optical disks, or 
solid-state storage media. 

• Transmission via narrow-band ISDN for reporting 
links and tele- or videoconferencing. 
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Fig. 9. Layer III bit-stream organization 




header 




header 


frame 1 




frame 2 






CM 


o 




o 


c 




sync 
side in) 


o — 
c co 

>> S 

CO CO 








Now 



header 
frame 3 



CO 

o 

CD 



O 



main_datajt>egin 1 



matn_data_begin 2 

ma>n_datajt>egin 3 



» \ V- "* N \ \ V - "*• W* 




header 
frame 4 



o 

JO 
C CD 
CO CO 




main_datajbegin 4 



main Info 1 main info 2 main info 3 main info 4 

Fig. 10. Layer 111 bit-stream organization with peak demand at main information 3 and small demand at main information 2. 
730 J. Audio Eng. Soc.. Vol. 42, No. 10. 1994 October 



• Distribution from the studio to transmitter stations and 
contribution links between studios via ISDN. 

7 CONCLUSIONS 

In the last 10 years high-quality audio coding went 
from the first research project to the first commercial 
applications. With MPEG-1 Audio, the first phase of 
the development of high-quality audio coding is finished. 

Perceptual coders using either high-time-resolution 
polyphase filter banks or high-frequency-resolution 
transform filter banks have reached the quality necessary 
for widespread use in broadcasting, telecommunication, 
computer, and consumer applications. 

The new MPEG-1 Audio coder delivers state-of-the- 
art performance. A range of coding modes is provided 
ranging from Layer 1, which allows a very simple de- 
coder implementation, to Layer III, which delivers the 
best performance at 64-kbit/s per audio channel. 
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